17.6 Minimalist Approaches to Deciphering DNA

263

17.5.2

Hidden Markov Models

Knowledge of the actual biological sequence of processing operations can be used to

exploit the effect of the constraints on (nucleic acid) sequence that these successive

processes imply. One presumes that the Markov binary symbol transition matrices

are slightly different for introns, exons, promoters, enhancers, the complementary

strand, and so forth. One constructs a more elaborate automaton, an automaton of

automata, in which the outer one controls the transitions between the different types

of DNA (introns, exons, etc.) and the inner set gives, for each type, the 16 different

binary transition probabilities for the symbol sequence. More sophisticated models

use higher order chains for the symbol transitions; further levels of automata can

also be introduced. The epithet “hidden” is intended to signify that only transitions

from symbol to symbol are observable, not transitions from type to type. The main

problem is the statistical inadequacy of the predictions. A promoter may only have

two dozen bases; a fourth-order Markov chain for nucleotides has of the order of

10 Superscript 101010 transition probabilities.

Problem. Construct a hidden Markov model for the mitogen-activated protein kinase

signalling cascade (Sect. 18.7).

17.6

Minimalist Approaches to Deciphering DNA

The inspiration for this approach is the study of texts written in human languages. A

powerful motivation for the development of linguistics as a formal field of inquiry was

the desire to understand texts written in “lost” languages (without living speakers),

especially those of antiquity, records of which began pouring into Europe as a result

of the large-scale expeditions to Egypt, Mesopotamia, and elsewhere undertaken in

the nineteenth and twentieth centuries. More recently, linguistics has been driven by

attempts to automatically translate texts written in one language into another.

One of the most obvious differences between DNA sequences and texts written

in living languages is that the former lacks separators between the words (denoted

by spaces in most of the latter). Furthermore, unambiguous punctuation marks gen-

erally enable phrases and sentences in living languages to be clearly identified. Even

with this invaluable information, however, matters are far from determined, and the

study of the morphology of words and the rules that determine their association into

sentences (syntax)—that is, grammar—is a large and active research field.

For DNA that is ultimately translated into protein sequences, the nucleic acid–base

pairs are grouped into triplets constituting the reading frames, each triplet correspond-

ing to one amino acid. A further peculiarity of DNA compared with human languages

is that reading frames may overlap; that is, from the sequence AAGTTCTG… one

may derive the triplets AAG, AGT, GTT, TTC, …. This is encountered in certain